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Abstract 

The derivation of statistical properties for Partial Least Squares regression 
can be a challenging task. The reason is that the construction of latent compo- 
nents from the predictor variables also depends on the response variable. While 
this typically leads to good performance and interpretable models in practice, 
it makes the statistical analysis more involved. In this work, we study the in- 
trinsic complexity of Partial Least Squares Regression. Our contribution is an 
unbiased estimate of its Degrees of Freedom. It is defined as the trace of the 
first derivative of the fitted values, seen as a function of the response. We estab- 
lish two equivalent representations that rely on the close connection of Partial 
Least Squares to matrix decompositions and Krylov subspace techniques. We 
show that the Degrees of Freedom depend on the collinearity of the predictor 
variables: The lower the collinearity is, the higher the Degrees of Freedom are. 
In particular, they are typically higher than the naive approach that defines the 
Degrees of Freedom as the number of components. Further, we illustrate how 
our Degrees of Freedom estimate can be used for the comparison of different 
regression methods. In the experimental section, we show that our Degrees of 
Freedom estimate in combination with information criteria is useful for model 
selection. 
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1 Introduction 



Partial Least Squares regression (PLSR) (Wold 1975) is a two-step regularized regres- 
sion technique. It iteratively constructs an orthogonal set of latent components from 
the predictor variables which have maximal covariance with the response variable. 
This low- dimensional representation of the data is then used to fit a linear regression 
model. PLSR is extended to nonlinear regression problems via a transformation of 
the predictor variables (Durand and Sabatier 1997, Rosipal and Trejo 2001). 

For model selection in PLSR, the optimal number of components has to be de- 
termined. While cross-validation is the standard approach, an alternative is the use 
of information criteria, which use the complexity of the fitted model. In regression, 
the complexity of a fitting method is defined in terms of its Degrees of Freedom. 
Apart from their usefulness for model selection, Degrees of Freedom also quantify 
the intrinsic complexity of a regression method (see e.g. Van der Voet (1999) for an 
overview). In contrast to other standard regression techniques as Principal Compo- 
nents Regression or Ridge Regression (where the Degrees of Freedom equal the trace 
of the hat-matrix), the derivation of the Degrees of Freedom for PLSR is not straight- 
forward. This is due to the fact that PLSR is not linear in the sense that the fitted 
response does not depend linearly on the response. As the set of latent components 
is constructed in a supervised fashion, the projection of the response variable onto 
these components is a highly nonlinear function of the response. Therefore, it has 
been argued (e.g Martens and Naes (1989), Frank and Friedman (1993)) that the 
Degrees of Freedom of PLSR exceed the number of components. 

We provide an unbiased estimate of the generalized Degrees of Freedom of PLSR. 
It is defined as the trace of the Jacobian matrix of the fitted values, seen as a function 
of the response. We illustrate on benchmark data that the complexity of PLSR 
depends on the collinearity of the predictor variables: The higher the collinearity 
is, the lower the complexity is. Under additional assumptions on the collinearity 
structure of the data, we provide bounds for the Degrees of Freedom if one component 
is used. 

We present two different implementations, (i) The first one is derived via an 
iterative computation of the first derivative of the PLSR fit. To do so, we use the 
equivalence of Partial Least Squares Regression to the Lanczos decomposition of the 
matrix of predictor variables. This implementation has the advantage that it also 
provides an asymptotic distribution of the PLSR regression coefficients, which can 
be used for the construction of confidence intervals (Phatak et al. 2002, Denham 
1997). (ii) The second implementation computes the Degrees of Freedom directly, 
i.e. it avoids the computation of the derivative itself. This leads to a more favorable 
runtime. For the derivation, we use the close connection of PLS regression to Krylov 
subspace techniques. Both algorithms are implemented in the R package 'plsdof 
(Kramer and Braun 2009). 
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We investigate the performance of the Degrees of Freedom of PLSR with respect 
to model selection of the number of PLSR components. We compare the test errors 
based on 10-fold cross-validation and based on the Bayesian Information Criterion 
(BIC) (Schwarz 1978) in a simulation study. For the latter information criterion, we 
use our Degrees of Freedom estimate and the naive approach that defines the Degrees 
of Freedom of PLSR via the number of components. Our experiments show that the 
model selected based on our Degrees of Freedom estimate is typically less complex 
than the model selected by the naive approach. This can lead to a higher test error 
for the naive approach. In terms of prediction accuracy, our Degrees of Freedom 
approach is on a par with the gold-standard of cross-validation, providing further 
evidence that our estimates captures the true model complexity correctly. 



2 Methodological Background 

We consider a multivariate regression problem 

R3Y, = f( Xi ) +s i ,e i ~ AT (0, a 2 ) , (1) 

and the task is to estimate the unknown function / : W — > R from a finite set of n 
examples (xi, yi), . . . , (x n , y n ) GPxl, where ?/j is drawn from (1). We assume that 
the error variables e$ are independent. Let us denote by x and s(x) the mean and the 
standard deviation of the predictor examples x^ and by y the mean of the response 
samples y^. The n x p data matrix X is the matrix whose rows are the centered and 
scaled (to unit variance) Xi, and the vector y £ M™ consists of the centered response 
yi. While in the course of this paper, we assume that the regression function / is 
linear, 

f(x) = /3 (0) + f3 T x, /3 (0) eR,(3 eW , (2) 

the definitions given in Subsection 2.1 do not require / to be a linear function. Finally, 
we define 

S = —^—X T X £ W xp and s = —^—X T y £ W. (3) 
n — 1 n — 1 

2.1 Degrees of Freedom and Model Selection 

Regularized regression methods typically yield a set of estimates fx of the true re- 
gression function /, and the parameter A determines the amount of regularization. 
In Partial Least Squares Regression, the parameter A corresponds to the number of 
latent components. 

The task is to determine the optimal parameter value A. Information criteria 
are based on the rationale that the true error of fx can be estimated in terms of its 
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training error and its complexity. In regression problems, the complexity is defined via 
Degrees of Freedom. These are defined for the class of methods that are linear in the 
sense that the fitted values are a linear function of y, i.e. y\ = H\y with H\ G IR pxp 
a matrix that does not depend on y. Popular examples are Ridge Regression and 
Principal Components Regression. The matrix H\ is called the hat-matrix. In the 
linear case, the Degrees of Freedom are defined as the trace of the hat-matrix, 

DoF(A) = trace (H x ) . (4) 

As we point out below, PLS regression is not a linear method, and the above definition 
cannot be applied. In order to extend the notion of Degrees of Freedom to PLSR, we 
employ the generalized definition proposed by Efron (2004). 

Definition 1. Let f\ be an estimate of the true regression function f, parameterized 
by A. We define the vector of fitted values as y\ = (f\(xx, ),..., f\(x n )) T . The 
Degrees of Freedom are 



DoF (A) = E 



trace 



dyx 
dy 



Here, the input X is assumed to be fixed and the expectation E is taken with respect 
to y u . . .,y n . 

The Degrees of Freedom measure the sensitivity of the fitted values, seen as a 
function of y . Note that for the special case of linear methods, this definition coincides 
with (4). 

Popular examples of information criteria include the Akaike information criterion 
(Akaike 1973), the generalized minimum description length (Hansen and Yu 2001) 
and the Bayesian information criterion (BIC) (Schwarz 1978) 

bic(A) = \\y x -y\\ 2 + \og(n)a 2 BoF(X). 

For information criteria, we need an estimate of the noise level a defined in (1). For 
linear methods y x = H x y, this is accomplished as follows. The residual is of the form 
y — y x = (I n — H\) y. The bias-variance decomposition of the mean squared error 
yields 

E[\\y-yx\\ 2 ] = \\E[y -y x }\\ 2 + trace ((I n -H x )(I n -Hj)) a 2 . 
By dropping the unknown bias term \\E[y — y\]\\ 2 , we yield an estimate of a via 

~2 ll^A-^H 2 (5) 

trace ((J n - H x )(I n - Hj)) ' 
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If H\ is a projection operator (which is e.g. true for Principal Components Regres- 
sion), the expression is simplified to 



- 2 \\y\-yf (R s 

° = n - DoF(A) • (6) 

We note that the latter estimate is most commonly used, even if the above assumption 
is not fulfilled. 



2.2 Partial Least Squares Regression 

PLSR constructs m latent components T = (ti, . . . , t m ) G R nxm from the predictor 
variables X such that the components U are mutually orthogonal and that they have 
maximum covariance to the response y. In the NIPALS algorithm (Wold 1975), the 
first component t\ = Xw± maximizes the squared covariance to the response y, 

||cov(A;«>,y)|| 2 w T X T yy T Xw T 
W\ = argmax = argmax oc X y. 17) 

w W' W w W W 

Subsequent components t 2 ,t 3 ,... are chosen such that they maximize the squared 
covariance to y and that all components are mutually orthogonal. Orthogonality is 
enforced by deflating the original variables X , 

X t = X V, , X (8) 

Here, Vt^...,^^ denotes the orthogonal projection onto the space spanned by t%, . . . , U-i. 
We then replace X by X^ in (7). While the matrix T = (ti, . . . , t m ) is orthogonal 
by construction, it can be shown that the matrix W = (w 1 ,...,w m ) G R dxm is 
orthogonal as well (e.g. Hoskuldsson (1988)). The m latent components T are used 
as regressors in a least squares fit in place of X, leading to fitted values 



V,. 



yl n + T(T T T) 1 T T y = yln + V T y. (9) 



We emphasize again that PLSR is not a linear estimator as defined in Section 2.1. 
The projection matrix Vt depends on the response as well, as the latent components 
T are defined in terms of both X and y. 

To determine the estimated regression coefficients and intercept in (2), we define 

L = T T XW eR mxm , (10) 

and obtain (3 m = DWL x T T y for the regression coefficients (Manne 1987, Hoskulds- 
son 1988) and f3^} = y — x T (3 m for the intercept. Here, D is the diagonal pxp scaling 
matrix with entries da = l/s(cc)j. 
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Table 1: Properties of the three benchmark data sets. A detailed description of the 
data can be found in the appendix. 



data set variables examples mean absolute correlation 



kin (fh) 32 8192 low (0.009) 

ozone 12 203 medium (0.260) 

cookie 700 70 high (0.867) 



3 Unbiased Estimation of the Degrees of Freedom 

The latent components T of PLSR depend on the response y. Therefore, the rela- 
tionship between y and the fitted PLSR values y m is nonlinear, and the compact 
formula (4) for the Degrees of Freedom cannot be applied. However, we can use the 
more general Definition 1 to obtain an unbiased plug-in estimate. 

Proposition 2. An unbiased estimate of the Degrees of Freedom of PLSR with m 
latent components T = (ti, . . . , t m ) is given by 



The constant term 1 corresponds to the estimation of the intercept (3^°\ which 
consumes one Degree of Freedom. For the derivation, we need to compute the trace of 
the derivative in (11) explicitly. We propose two equivalent algorithms in Subsections 
3.3 and 3.4. 

3.1 Illustration and a Lower Bound 

Before delving deeper into the details of the implementation, we illustrate the prop- 
erties of the Degrees of Freedom on benchmark data. An overview of the data sets 
is given in Table 1 (see the appendix for more details). We choose the three particu- 
lar data sets as they differ with respect to the collinearity structure of the predictor 
variables. As an indicator for the degree of collinearity, we compute the mean of the 
absolute empirical correlation coefficients defined by s = (2/ {p 2 —p)) T7i<j \ s ij\ ■ Here 
Sij is the (z, j)-entry of the empirical correlation matrix S of X. The values of s are 
displayed in the fourth column of Table 1 . 

In Figure 1, we plot the Degrees of Freedom for the three data sets as a function 
of the number of components. (For the large data set kin (f h) , we use a subsample 
of size 300.) In addition, we display the naive estimate DoF(m) = m + 1. In the 
examples, our Degrees of Freedom estimate always exceeds the naive approach. This 



DoF(m) 




(11) 
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kin (fh) ozone cookie 




5 10 15 20 25 30 2 4 6 8 10 12 10 20 30 40 50 60 70 

components components components 



Figure 1: Estimated Degrees of Freedom (stars) for the three benchmark data sets. 
The solid line displays the naive estimate DoF(m) = m + 1. If the assumption of 
theorem 3 is fulfilled, we also display the lower bound on the Degrees of Freedom for 
1 component (dashed horizontal line). 



supports the conjecture that DoF(m) > m + 1, which is voiced e.g. in Martens and 
Naes (1989) and Frank and Friedman (1993). 

Furthermore, Figure 1 shows that the correlation structure of the predictor vari- 
ables determines the shape of the DoF-curve. In our examples, the complexity is 
higher for data with low correlation. We underpin this observation with a lower 
bound on the Degrees of Freedom of PLSR with m — 1 component. 

Theorem 3. If the largest eigenvalue A max of the empirical correlation matrix S 
defined in (3) fulfills 

Amax < ^trace(S) , (12) 

then 

DoF(m = l) > 1 + tm . Ce(S) ■ (13) 

''max 

Condition (12) controls the amount of collinearity of X. If the collinearity is low, 
the decay of the eigenvalues of S is slow (and condition (12) is fulfilled). The lower 
bound (13) is higher for data with low collinearity. In Figure 1, we add the lower 
bound for the data sets kin (fh) and ozone, which fulfill condition (12). 

Proof. We express the PLS fit for one component in terms of S and s defined in (3). 
Recall that the first latent component is defined as ti = Xs, which implies 
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After computing the derivative of this term with respect to y and computing its trace, 
we obtain 



DoF(m = 1) 



s T s 



s T Ss 



trace (S) — 2 



(s T S 2 s) 
s T Ss 



Now, by definition of maximal eigenvalues, s T S 2 s/s T Ss < X max and s T s/s T Ss > 

1 / X max ■ It follows that 

, (s T S 2 s) , 

trace(S) — 2 — — > trace(S) — 2A max . 

Condition (12) ensures that the right-hand side of this inequality is non-negative, 
hence, 



DoF(m = 1) > 3 

> 3 



s T s 

1 



[trace(S') — 2A r 



A, 



[trace (5) - 2A max ] = 1 



trace (S) 



□ 



3.2 Comparison of Regression Methods 

While the regularization parameters Ai and A2 for two competing regression methods 
cannot be compared directly, their corresponding Degrees of Freedom values DoF(Ai) 
and DoF(A2) are on the same scale. Hence, Degrees of Freedom allow us to compare 
the model complexity across different regularized regression approaches. 

To illustrate this point, we compare the model complexity of PLSR, Principal 
Components Regression (PCR), and Ridge Regression on the ozone data set. Recall 
that for PCR, the Degrees of Freedom are the number of principal components plus 
one, and for Ridge Regression, 

DoF(A) ridgc = 1 + trace ( X (X T X + XI P ) ^ X T ) . 

We split the data set into 50 training examples and 153 test examples. On the training 
data, we compute the optimal model parameter for the respective methods with 10- 
fold cross-validation. We assess the predictive performance by computing the mean 
squared error of prediction on the test set. This procedure is repeated 50 times. 

Let us start with the observation that the predictive performance of the three 
methods is similar on this data set (left plot in Figure 2). This is in accordance with 
previous observations that the methods often perform similarly (see e.g. Frank and 



S 



mean squared error components degrees of freedom 




PLS PCR RIDGE PLS PCR PLS PCR RIDGE 



Figure 2: Comparison of PLS, PCR and Ridge Regression. Left: mean squared error 
of prediction for the three regression methods. Center: Number of cross-validation 
optimal number of components for PLS and PCR. Right: Cross-validation optimal 
Degrees of Freedom for the three regression methods. 

Friedman (1993)). Next, we compare the model complexity of PLSR and PCR in 
terms of the cross-validation-optimal number of components. This is displayed in 
the center plot of Figure 2. On average, PLSR selects fewer components than PCR. 
However, in terms of Degrees of Freedom, PLSR selects more complex models than 
PCR (right plot in Figure 2). 

Further, it is known (De Jong 1993) that for a fixed number of components, PLSR 
obtains a lower approximation error ||y z s — y m \\ 2 /n than PCR. Here, y a i s are the fitted 
values obtained by ordinary least squares. This implies that for a fixed number of 
components, the training error \\y — y m \\ 2 /n of PLSR is lower than the one of PCR. 
This is illustrated in the left plot of Figure 3. Here, we plot the mean training error 
for PLSR and PCR over all 50 runs of the experiments as a function of the number 
of components. 

The lower approximation error is sometimes used as an argument in favor of 
PLSR, as it often leads to the selection of fewer components compared to PCR (as 
illustrated in Figure 2). While this can be advantageous in problems where the latent 
components are used for model interpretation (e.g. for the visualization of the data 
in terms of a 2-D or 3-D plot), we stress again that the number of components do 
not capture the intrinsic model complexity, and that models with the same Degrees 
of Freedom should be compared instead. In the right plot of Figure 3, we therefore 
plot the mean training error of PLSR and PCR as functions of the corresponding 
Degrees of Freedom. The plot shows that the difference of the approximation error 
is marginal in terms of Degrees of Freedom. Furthermore, the plot illustrates that 
PLSR concentrates a lot of components on regions with low training error and high 
complexity: All PLS models (with m > 1) consume at least 4 Degrees of Freedom, 
and for m > 3, the Degrees of Freedom already exceed 10. Compared to PCR, this 
might be disadvantageous in situations where the true underlying model has a low 
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complexity, as PLSR models do not explore the full range of Degrees of Freedom. 




2 4 6 8 10 12 2 4 6 8 10 12 

components degrees of freedom 



Figure 3: Training error of PLSR and PCR. Left: Training error as a function of 
the number of components. Right: Training error as a function of the Degrees of 
Freedom. 

To summarize, the direct comparison of model parameters for different regression 
methods is in general not possible, because they are either on a different scale (e.g. 
for PLSR and Ridge Regression) or they are just seemingly on the same scale (e.g. for 
PLSR and PCR) and a direct comparison would lead to misleading results. Degrees 
of Freedom offer a valuable solution to this problem. 

3.3 First Derivative of the Lanczos Decomposition 

We now present the first implementation of the Degrees of Freedom estimate. We 
extend the approaches that are proposed in Denham (1997) and Serneels et al. (2004). 
There, the iterative formulation of the NIPALS algorithm is used to construct the 
derivative of j3 m . Instead, our algorithm is based on the matrix decomposition defined 
by (10). Note that the matrix L is upper bidiagonal, i.e. kj = for i > j or i < j — 1. 
The relationship (10) defines a Lanczos decomposition (Lanczos 1950) of X, i.e. a 
decomposition into orthonormal matrices T and W and an upper bidiagonal matrix 
L. For fixed y and m, the decomposition is unique. The Lanczos decomposition can 
be interpreted as the analogue to the singular value decomposition that is defined by 
Principal Components Analysis. For more details on the equivalence of PLSR and 
Lanczos decompositions, we refer the readers to Elden (2004). 

We propose an implementation of PLSR that iteratively constructs the derivative 
of the projection operator Vt based on the Lanczos decomposition. This has the 
advantage that we also obtain the derivative of the regression coefficients f3 m , which 
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can then be used to construct approximative confidence intervals (Denham 1997, 
Phatak et al. 2002). 

We proceed in three steps. First, we derive a fast recursive PLSR algorithm based 
on the Lanczos decomposition. This algorithm avoids the explicit deflation of X as 
in (8), and only depends on projections onto one-dimensional subspaces. Second, we 
determine the first derivative of one- dimensional projection operators (Kramer and 
Braun 2007), which are 

d(v/\\v\\ s ) 9 ( v /^ vTSv ) 1 / _ vv T S \ dv 
dy dy \\ v \\s V ™ vTv J dy ' 

and 

d (vv T z) . T T dv T dz 

—4 '- = (vz T + v T zI n ) — + VV — . 

ay v 'ay ay 

Finally, we differentiate the recursive formulas of the Lanczos representation. Algo- 
rithm 1 displays the result. Its derivation can be found in the appendix. 



Algorithm 1 Derivative of the regression coefficients and Degrees of Freedom 

1: Input: centered and scaled data X, y, number m of components 

2: n = number or examples 

3: S = (X T X) /(n -1),8= (X T y) /(n - 1) 

4: Initialization: (3 = P , (df3 /dy\ = pxn 

5: for i — 1, . . . , m do 

6: Wi = S — 

7: (dWi/dy) = X T /(n - 1) - S (d%_Jdy) 

8: Vi = Wi- J2 l j=\ VjVjSWi 

9: (dvi/dy) = (dWi/dy) - ^=1 (dvjvj Swi/dy) 

10: Vi = y/n - r«i/||V j||5 

ll: ^ (dvi/dy) = y/n-ld («i/||«i||s) /dy 

12: 3i = 3i-l + ViVjs 

13: dfijdy = d%_Jdy + (dVivJs/dy) 

14: end for 

15: DoF(m) = 1 + trace (xd$ m /dy) . 



As we compute the derivative of the regression coefficients as well, we can estimate 
the covariance of the PLSR coefficients by using a first order Taylor approximation 
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j3 m ~ \ d(3 m /dy \ y, which leads to 



dov(3 m ) = a 



2 d @ m f ^Prn 

dy \ dy 



T 



Furthermore, we can use the first order Taylor expansion to construct an approximate 
hat-matrix for PLSR via 

fm,approx = ~Qy~ ' 

This matrix can be plugged into formula (5) for the estimation of the noise level. 



3.4 Trace of the Krylov Representation 

The computation of the derivative of y m in Subsection 3.3 involves repeated matrix- 
matrix-multiplications. For high- dimensional data, this can become very time-consuming 
As we do not need the derivative itself for the Degrees of Freedom but only its trace, 
we reduce the computational load by cleverly rearranging the computation of the 
derivative. 

To this end, we use a closed form expression of the fitted values y m that is based 
on Krylov subspaces. We use the fact (Hoskuldsson 1988) that span{ti, . . . ,t m } = 
span {Ky, . . . , K m y} =: JC m , with K = XX T the n x n kernel matrix. The space 
on the right-hand side is called the Krylov subspace defined by K and Ky. We 
use the explicit representation y m = y + Vic m to compute its derivative. In Phatak 
et al. (2002), a corresponding formula for (3 m is used to determine its approximate 
distribution. We extend this result to the derivative of the fitted values. Additionally, 
after computing the derivative, we apply the basis transformation B = ((tj, K^y)) e 
]R mxm to improve numerical stability. This yields the following result. 

Proposition 4. Set 

c = B~ l T T y G M. m and V = (v u . . . , v m ) = T G R nxm . 



We have 
dy n 



m m 

-I n + cj {I n ~ TT T ) K + Vj (V - Vm? K + TT 1 



dy n 

In contrast to the Lanczos representation (Subsection 3.3), the representation in 
Proposition 4 is more convenient for the computation of the Degrees of Freedom, as 
its trace can be computed directly. 
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Proposition 5. The unbiased estimate for the Degrees of Freedom of PLSR with m 
components equals 

mm m 

DoF(m) = 1 + trace (.ft?) - ^ti T KHi + {y - y m ) T ^K j Vj + m . 

3=1 j,l=l j=l 

Hence, for the computation of the Degrees of Freedom of PLSR, we need a single 
fit of the PLSR algorithm that returns the matrix T of latent components. One can 
either use the original formulation of the NIPALS algorithm (Subsection 2.2) or the 
Lanczos decomposition (Algorithm 1) without the computation of the derivative. 



4 Experiments 

In this section, we evaluate the performance of the Degrees of Freedom estimate with 
respect to model selection for PLSR. We investigate the performance both with re- 
spect to prediction accuracy and model complexity (Subsection 4.1), and with respect 
to the estimation of the noise level a of the regression model (Subsection 4.2). Fur- 
ther, in Subsection 4.3, we discuss numerical issues and the computational efficiency 
of our proposed algorithms. 

In our experiments, we consider the Bayesian Information Criterion. We con- 
ducted experiments with the Akaike Information Criterion and the generalized Min- 
imum Description Length as well, but we found that our main observations on 
the difference between our Degrees of Freedom estimate and the naive approach 
DoF(m) = m + 1 do not depend on the particular criterion. Hence, for the sake 
of clarity, we only report the results for the Bayesian Information Criterion. 

We would like to stress that the primary goal of this section is not to advertise 
BIC in combination with our Degrees of Freedom estimate as a novel model selection 
criterion that is universally preferable to existing methods. Instead, we want to 
provide further empirical evidence that our estimate captures the intrinsic model 
complexity of PLSR reliably. 

We use the ozone data set as a starting point for our simulation study. As a 
preprocessing step, we scale all variables of the data set to lie in the interval [-1,-1-1] . 
The true regression function / in equation (1) is of the form 

d 

fix) = s ^ / f3j(j)j{x) with fixed basis functions 4>j{x) = exp (— \\x — Cj\\ 2 ) . 

3=1 

The coefficients of the center points Cj e M p are chosen uniformly from [—1, +1]. The 
regression coefficients f3j are chosen randomly from a uniform distribution over [1,3]. 
We stress that the center points and hence the basis functions are fixed a-priori. This 
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is to ensure that we are still in a parametric regression scenario, for which the Bayesian 
Information Criterion is suited. In the simulation study, we vary the number d of basis 
functions from 10 to 210 in steps of 40. We choose the variance a 2 of the noise variable 
such that the signal-to-noise-ratio equals 9. After the transformation of the initial 
data matrix X via the basis functions (0i, . . . , (pa), we obtain a d- dimensional data 
set, and are in the setting of a linear regression model. 

We split the data into 50 training and 153 test points. The small training sam- 
ple size allows us to consider high-dimensional settings, and the large test sample 
size ensures a reliable estimation of the test error. On the training data, we apply 
four different model selection criteria, (a) CV: 10-fold cross-validation, (b) LANC- 
ZOS: Bayesian Information Criterion with the Degrees of Freedom computed from 
the Lanczos decomposition (Algorithm 1). For the estimation of the noise level, we 
use equation (5) with the approximate hat-matrix defined in (14). (c) KRYLOV: 
Bayesian Information Criterion with the Degrees of Freedom computed from the 
Krylov representation (Proposition 5). For the estimation of the noise level, we use 
equation (6). (d) NAIVE: Bayesian Information Criterion with the naive Degrees of 
Freedom DoF(m) — rn+1. For the estimation of the noise level, we use equation (6). 
Note that LANCZOS and KRYLOV use the same Degrees of Freedom estimate, and 
only differ in the estimation of the noise level a . As the computation of the Degrees 
of Freedom depends on two different implementations, their runtime differs. Further, 
for all four methods, we set the range of the number of components from to 30. 

For each of the four criteria, we measure the performance on the hold-out set of size 
153 by computing the normalized mean squared error: We divide the mean squared 
test error by the mean squared test error of the trivial model, i.e. the constant model 
equal to the mean of the training data. This normalization facilitates the comparison 
between different values of d. The procedure is repeated 50 times. 

4.1 Prediction Accuracy and Model Complexity 

We display the median and the boxplots of normalized test errors in Figure 4. On 
average, our Degrees of Freedom estimate in combination with BIC shows a compara- 
ble prediction accuracy to cross-validation (right plot of Figure 4). The two different 
approaches for the computation of the Degrees of Freedom (KRYLOV and LANC- 
ZOS), which only differ in the estimation of a, do not show any clear difference over 
the different simulation settings. Note however that LANCZOS is in general compu- 
tationally more expensive. We refer the readers to Subsection 4.3 for more details on 
computation time. 

For the higher- dimensional scenarios (d > 90) in our simulation study, the naive 
approach leads to considerably worse results than the three other methods (left plot 
in Figure 4). Figure 5 illustrates that the naive approach tends to underestimate 
the true underlying complexity of the model. Compared to cross-validation and our 
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Figure 4: Simulation results. Left: Median of the normalized test error for the four 
different model selection criteria. Right: Boxplot of the normalized test error for all 
methods except for the naive approach. 

DoF estimate, it selects more components and Degrees of Freedom respectively. A 
closer look at the simulation reveals that the selection of more components does not 
automatically lead to a higher test error. More precisely, for d = 50 dimensions, the 
naive approach selects considerably more components compared to the three other 
methods, yet the prediction accuracy is on the same level. In contrast, for higher 
number of dimensions, the increased model complexity also affects the prediction 
accuracy. A possible explanation for this phenomenon is the following. For many 
moderate sized data sets, the test error first decreases sharply with the number of 
components, and then reaches a flat plateau for higher number of components. In 
this case, a more complex model can lead to a comparable prediction accuracy even 
with higher number of components. To underpin this point, we plot a scaled test 
error for d = 50 and d = 210 as a function of the number of components (see Figure 
6). Here, we scale the test error such that its minimal value for a fixed d is equal to 1. 
For d = 50, the relative decrease in prediction accuracy is only moderate if we choose 
too many components, and the test error curve is flat. In contrast, for d = 210, the 
relative increase of the test error is steep, and a selection of more components than 
the test-error optimal ones can lead to poor results. 
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Figure 5: Simulation results. Left: Number of selected components. Right: Degrees 
of Freedom of the selected model. 



4.2 Estimation of the Noise Level a 

We now investigate the quality of the three different PLS based estimates for the 
noise level a that are obtained by KRYLOV, LANCZOS and NAIVE. 

Figure 7 displays normalized estimates a obtained on the 50 training samples: We 
divide each estimate by the true noise level a that is determined by the signal-to- 
noise ratio (which is set to 9). The dashed horizontal line indicates the theoretical 
optimum of 1. We observe that KRYLOV and LANCZOS slightly overestimate the 
noise level, which would in turn lead to slightly too conservative confidence intervals 
when used in inference problems. In contrast, the naive approach underestimates a, 
and due to the complex models that it selects for higher dimensions (see Subsection 
4.1), the quality of the estimate deteriorates. 



4.3 Numerical Stability and Runtime 

As explained in Subsection 3.3, the sparse structure of the matrix L defined in (10) 
allows us to derive a fast iterative algorithm for PLSR and its derivative (Algorithm 
1). In practice, we observe that the sparsity leads to numerical problems: After a 
certain number of components, the latent components ti are not mutually orthogonal 
anymore. This typically affects the computation of the Degrees of Freedom as well 
and leads to implausible results (e.g. negative Degrees of Freedom). In Kramer 
and Braun (2007), the sparse structure of L is used and an additional stopping 
criterion is imposed to ensure that the latent components are orthogonal. However, 
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Figure 6: Illustration: Scaled mean test error as a function of the number of com- 
ponents. For each value of d (dimensionality), the test error is scaled such that its 
minimum value is 1. 



this algorithm can stop very early. Therefore, we use equation (16) in the appendix 
that requires little additional computation time but ensures stability. 

In some of the data, we observe that for a rather large number of components, both 
implementations for the Degrees of Freedom return negative Degrees of Freedom. This 
indicates a numerical problem. Therefore, in our experiments, we set the maximum 
number of components to m* if we observe negative Degrees of Freedom for + 1 
components. 

Finally, in Figure 8, we compare the runtime of the four different methods in 
our simulation study. While the absolute values are low, the four criteria show clear 
differences. The Lanczos decomposition is by far the slowest approach, as it first 
computes the derivative of the PLSR fit before computing the Degrees of Freedom. 
With respect to runtime, the algorithm based on the Krylov representation is therefore 
preferable, if the explicit derivative is not needed. It is also faster than 10-fold cross- 
validation. The naive approach is always the fastest method, as it only requires a 
single run of the PLSR algorithm and no additional computation of the Degrees of 
Freedom. 
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Figure 7: Simulation results. Normalized estimated noise levels a /a. The dashed 
horizontal line indicates the theoretical optimum. 



5 Discussion and Conclusion 

Our findings show that typically, the Degrees of Freedom are higher for data sets 
with predictor variables that have low correlation and that each PLSR component 
consumes more than one Degree of Freedom. This confirms the longstanding conjec- 
ture that DoF(m) > m + 1. This result may not come as a big surprise: For a fixed 
number of components, PLSR is less biased than Principal Components Regression 
(De Jong 1993). This decrease in bias is balanced by an increased complexity in terms 
of Degrees of Freedom. 

On average, the Degrees of Freedom of PLSR in combination with information 
criteria yield a similar prediction accuracy when compared to cross-validation. The 
naive approach that defines the Degrees of Freedom as the number of components 
plus one selects more complex models, which can in turn lead to worse prediction 
accuracy. This also effects the estimation of the noise level a of the regression model. 
While our approach slightly overestimates a, the naive approach yields estimates 
that are considerably biased downwards. From the computational point of view, 
the implementation based on the Krylov representation is preferable to the Lanczos- 
based algorithm, and - depending on the number k of splits - is faster than /c-fold 
cross-validation. 

In this paper, we applied the Degrees of Freedom estimate to the selection of the 
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Figure 8: Simulation results. Left: Median runtime in seconds for the four different 
methods. Left: Median runtime in seconds for all methods except for LANCZOS. 

optimal number of PLSR components. It is possible to extend our framework to 
penalized PLSR (Goutis and Fearn 1996, Reiss and Ogden 2007, Kramer et al. 2008), 
where an additional smoothing parameter has to be selected. The derivation of the 
Degrees of Freedom can be adapted accordingly. 

The two implementations for the Degrees of Freedom capitalize on the close con- 
nection between PLSR and methods from numerical linear algebra, namely the Lanc- 
zos decompositions and Krylov subspace approximations. Apart from the computa- 
tional advances that are pointed out in this paper, this connection is very fruitful to 
analyze statistical properties of PLSR in a concise way. Recent results on the cor- 
respondence of penalized PLSR to preconditioning (Kramer et al. 2008) and on the 
prediction consistency of PLSR (Blanchard and Kramer 2010a;b) underpin the po- 
tential of this connection. We strongly believe that the interplay between numerical 
linear algebra and PLSR will further stimulate the field of statistics. 
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A Derivation of Algorithm 1 

The weight vector u>i can be rewritten as u>i = X T (y — f/j_i) oc s — Sfi^ . We 
define the "pseudo" -weight vector Vi via tj = XiWi =: Xvi . Using the fact that the 
matrix L = T T XW defined in (10) is upper-triangular, we yield 

Vi = w { - v^vJ^Swi (15) 

i-2 

— ^ VjvJ Swi . (16) 
i=i 

Note that L is in fact also upper-diagonal, hence the term in second line is 0. How- 
ever, to ensure numerical stability, we include the term in our computation. The 
normalization of £j to unit length corresponds to 



y/n — 1 \fn — 1 
~n — n — v i = — , Vj ■ 
IK lis JvJSvi 



It follows that /3j = /3j_ x + v { vj s . 



B Description of the Data Sets 

kin (f h) Simulation of the forward dynamics of an eight link all-revolute robot arm. 
The 32 predictor variables correspond to positions of joints and to twist angles, length 
and offset distance for links. The task is to predict the distance of the end-effector 
from a target. The problem is fairly linear (f) and contains a high amount of noise (h). 
The data is available at the delve-repository http : //www. cs . toronto . edu/~ delve/. 

ozone Los Angeles ozone pollution data 1976. The 12 predictor variables contain 
the date of the measurement and information on wind speed, humidity, temperature 
etc. The task is to predict the daily maximum one-hour-average ozone reading. The 
original data contains missing values. From the 366 examples, we use the 203 examples 
with no missing values. The data is provided by the R-package 'mlbench' (Leisch and 
Dimitriadou 2010). 
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cookie Quantitative NIR spectroscopy for dough piece. A Near Infrared reflectance 
spectrum is available for each dough piece. The spectral data consist of 700 points 
measured from 1100 to 2498 nanometers (nm) in steps of 2 nm. The task is to predict 
the percentage of fat. The data is first analyzed in Brown et al. (2001), Osborne et al. 
(1984). The data set provided by the R-package 'ppls' (Kramer and Boulesteix 2009) 
also contains the percentage of sucrose, dry flour, and water. 
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